[Feature] Add default init container in workers to wait for GCS to be ready #973

kevin85421 · 2023-03-17T01:37:25Z

Why are these changes needed?

Currently, the init container logic is wrong. It waits for the head service rather than GCS server. The head service will be ready when the image pull finishes. The current retry logic is implemented by Ray internal.

kuberay/ray-operator/config/samples/ray-cluster.complete.yaml

Lines 124 to 129 in 71e260f

    
           initContainers: 
        
           # the env var $FQ_RAY_IP is set by the operator if missing, with the value of the head service name 
        
           - name: init 
        
             image: busybox:1.28 
        
             # Change the cluster postfix if you don't have a default setting 
        
             command: ['sh', '-c', "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for K8s Service $RAY_IP; sleep 2; done"]

For example, add command: ["sleep 180"] in the headGroupSpec. Then, the head Pod command will be sleep 180 && ulimit -n 65536; ray start .... To clarify, the GCS server requires a minimum of 120 seconds to become ready after the head service is ready. It exceed the timeout of the Ray internal retry mechanism, so the worker will fail.

In this PR, we add a default init container to use ray health-check to check the status of GCS so that can prevent this issue. In addition, each init container must complete successfully before the next one starts. Hence, it is fine for us to have two init containers at this moment to keep backward compatibility. We will remove the original init container from sample YAML files after release 0.5.0.

Related issue number

Closes #476

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

We have two init containers at this moment.

kevin85421 · 2023-03-17T05:19:39Z

cc @Yicheng-Lu-llll

DmitriGekhtman

Looks good.
There's a slight concern some users may require additional configuration (security policy, etc.) copied over to the init container.
I wonder if we should copy the entire container spec and replace the entry point -- that could have unforeseen consequences for some users, though.
( OSS is hard :) )

ray-operator/controllers/ray/common/pod.go

Yicheng-Lu-llll · 2023-03-18T15:59:04Z

Examined the implementation of the ray health-check: relevant PR, health-check code here, and here. The change Looks good to me!

btw, looks like ray health-check can also be used for the head pod's liveness probe.

kevin85421 · 2023-03-18T18:26:52Z

Looks good. There's a slight concern some users may require additional configuration (security policy, etc.) copied over to the init container.

Good point, I completely overlooked securityContext.

Test

OPERATOR_IMAGE=controller:latest python3 tests/test_security.py 2>&1 | tee log

# Use `kubectl edit pod -n pod-security ${WORKER_POD}` to check the init container's `securityContext`

I wonder if we should copy the entire container spec and replace the entry point -- that could have unforeseen consequences for some users, though. ( OSS is hard :) )

I prefer to make it simple at this moment. If we copy the entire Ray container, the init container may have many unused information (e.g. some environment variables.)

… ready (ray-project#973) Add default init container in workers to wait for GCS to be ready

kevin85421 changed the title ~~WIP~~ [Feature] Add default init container in workers to wait for GCS to be ready Mar 17, 2023

kevin85421 marked this pull request as ready for review March 17, 2023 05:13

kevin85421 requested review from Jeffwan, DmitriGekhtman, gvspraveen and architkulkarni March 17, 2023 05:19

gvspraveen approved these changes Mar 17, 2023

View reviewed changes

update

139bc5b

kevin85421 force-pushed the default-init-container branch from 5222472 to 139bc5b Compare March 18, 2023 05:00

DmitriGekhtman approved these changes Mar 18, 2023

View reviewed changes

Yicheng-Lu-llll reviewed Mar 18, 2023

View reviewed changes

ray-operator/controllers/ray/common/pod.go Outdated Show resolved Hide resolved

update

a459976

kevin85421 merged commit ed44425 into ray-project:master Mar 20, 2023

kevin85421 mentioned this pull request Mar 25, 2023

[Feature] Add a timeout to the init container #988

Closed

2 tasks

jasoonn mentioned this pull request Apr 4, 2023

Release v0.5.0 python client library validation #1006

Merged

4 tasks

This was referenced Apr 5, 2023

[Feature] Remove init container from sample YAML files after release 0.5.0 #974

Closed

[Post v0.5.0] Remove init containers from YAML files #1010

Merged

Yicheng-Lu-llll mentioned this pull request Apr 29, 2023

[Post release v0.5.0] Remove block from rayStartParams for python client and KubeRay operator tests #1050

Closed

jasoonn mentioned this pull request May 9, 2023

[Discussion] Whether to remove init_image parameter in Python client module #1072

Closed

2 tasks

Sharathmk99 mentioned this pull request May 10, 2023

[Bug] No option to disable Default Init Container in workers #1076

Closed

2 tasks

kevin85421 mentioned this pull request Jun 4, 2023

[Bug][Autoscaler] Operator does not remove workers #1139

Merged

4 tasks

kevin85421 mentioned this pull request Jul 31, 2023

[Feature] Allow wait-gcs-ready to choose or bypass volumes to mount #1280

Closed

2 tasks

lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023

[Feature] Add default init container in workers to wait for GCS to be…

a8226bd

… ready (ray-project#973) Add default init container in workers to wait for GCS to be ready

sutaakar mentioned this pull request Oct 10, 2023

Remove init-myservice init container from Ray worker nodes project-codeflare/codeflare-sdk#375

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add default init container in workers to wait for GCS to be ready #973

[Feature] Add default init container in workers to wait for GCS to be ready #973

kevin85421 commented Mar 17, 2023 •

edited

Loading

kevin85421 commented Mar 17, 2023

DmitriGekhtman left a comment

Yicheng-Lu-llll commented Mar 18, 2023 •

edited

Loading

kevin85421 commented Mar 18, 2023

	initContainers:
	# the env var $FQ_RAY_IP is set by the operator if missing, with the value of the head service name
	- name: init
	image: busybox:1.28
	# Change the cluster postfix if you don't have a default setting
	command: ['sh', '-c', "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for K8s Service $RAY_IP; sleep 2; done"]

[Feature] Add default init container in workers to wait for GCS to be ready #973

[Feature] Add default init container in workers to wait for GCS to be ready #973

Conversation

kevin85421 commented Mar 17, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

kevin85421 commented Mar 17, 2023

DmitriGekhtman left a comment

Choose a reason for hiding this comment

Yicheng-Lu-llll commented Mar 18, 2023 • edited Loading

kevin85421 commented Mar 18, 2023

kevin85421 commented Mar 17, 2023 •

edited

Loading

Yicheng-Lu-llll commented Mar 18, 2023 •

edited

Loading